This notebook is a port to pymc3 of the example given in the Metrum Institutes MI250: Introduction to Bayesian PK-PD Modelling and Simulation(2010) Lab2 session.
It illustrates the ease with which Bayesian models can be created and analysed with pymc3.
In this notebook we build a simple Emax model in pymc3 and look at a variety of ways to visually assess model fit and make predictions with new values.
https://youtu.be/AQDXRoBan6Y?list=PLvLDbH2lpyXNGV8mpBdF7EFK9LQJzGL-Y
In [440]:
from pymc3 import Model, autocorrplot, Normal, Lognormal, sample_ppc, Uniform, traceplot, df_summary, trace_to_dataframe,find_MAP, NUTS, sample, summary
import numpy as np
import pandas as pd
import seaborn as sns
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
data = pd.read_csv('/Lab2.csv' )
In [4]:
mean_df = data[['cobs', 'fxa.inh.obs']].groupby(data['subject']).mean()
cobs = np.array(mean_df['cobs'])
fxa_in = np.array(mean_df['fxa.inh.obs'])
In [5]:
data.head()
Out[5]:
The data are the results from a phase I, single dose clinical trial in healthy volunteers. There were 8 subjects per dose arm, with doses of 1.25, 5, 10, 15, 20, 30, 40, 60 and 80mg. In each subject drug concentration and the fxa.inh.obs biomarker are taken at the same time over multiple time points.
In [6]:
plasma = sns.FacetGrid(data, col="dose", hue="subject", col_wrap=3, size=3)
plasma.map(plt.plot, "time", "cobs", marker="o", ms=4)
Out[6]:
In [7]:
effects = sns.FacetGrid(data, col="dose",hue ="subject" ,col_wrap=3, size=3)
effects.map(plt.plot, "time", "fxa.inh.obs", marker="o", ms=4)
Out[7]:
In [8]:
relationship = sns.FacetGrid(data, col="dose",hue ="subject" ,col_wrap=3, size=3, sharex=False)
relationship.map(plt.plot, "cobs", "fxa.inh.obs", marker="o", ms=4)
Out[8]:
In [9]:
mean_df = data[['cobs', 'fxa.inh.obs']].groupby(data['subject']).mean()
In [10]:
sns.lmplot(x="cobs", y="fxa.inh.obs", data=mean_df, lowess=True)
Out[10]:
I have used the lowess function as a simple function to roughly plot the relation ship between mean drug plasma concentration 'cobs' and the fxa biomarker.
Now we will create the model, with fairly uninformative priors. Notice how I have also created a dummy variable called fxa_pred. This will create len(cobs) = 80 traces which we will use later as one of the different methods to assess the fit of the model.
We are fitting a simple eMax model to the data.
In [439]:
pkpd_model = Model()
with pkpd_model:
# Priors for unknown model parameters
emax = Uniform ('emax', lower =0, upper =100)
ec50 = Lognormal('ec50', mu=0.01, tau = 5)
gamma = Uniform('gamma', lower=0, upper =10)
sigma = Uniform('sigma', lower = 0, upper = 1000 )
# Expected value of outcome
fxaMean = (emax*(cobs**gamma)) / ((ec50**gamma) + (cobs**gamma))
# Likelihood (sampling distribution) of observations
fxa = Normal('fxa', mu=fxaMean, sd=sigma, observed=fxa_in )
# Dummy fxa variable that will hold the model predictions for each value of cobs
fxa_pred = Normal('fxa_pred', mu=fxaMean, sd=sigma , shape =len(cobs))
Firstly we find some reasonable starting values using find_MAP
Then we will take 2000 MCMC samples
In [441]:
with pkpd_model:
# obtain starting values via MAP
start = find_MAP(fmin=scipy.optimize.fmin_powell)
# draw 2000 posterior samples
trace = sample(2000, start=start)
In [118]:
traceplot(trace, lines={k: v['mean'] for k, v in df_summary(trace).iterrows()})
Out[118]:
There are a number of ways to evaluate model fit. We will begin with the easiest...
First lets convert our traces to a dataframe for ease of use
In [443]:
t_df = trace_to_dataframe(trace)
In [444]:
t_df.head()
Out[444]:
In [451]:
# Create a range of 'x' values to iterate over
i = np.array(range(int(min(cobs)),int(max(cobs)),1))
# Extract the mean values for the model parameter from the trace dataframe
g = t_df['gamma'].mean()
e50 = t_df['ec50'].mean()
em = t_df['emax'].mean()
# recreate the 'y' values of fxa
y = (em*(i**g)) / ((e50**g) + (i**g))
# Then plot everything
plt.figure(num=None, figsize=(8, 8), facecolor='w', edgecolor='k')
plt.scatter(cobs, fxa_in, alpha =0.2)
plt.plot(i, y ,'r')
Out[451]:
This way is okay, but it doesnt really give any feel for the uncertainty we have for fxa.
You can add in the percentile values for each model parameter and then replot, but these do not reflect the credible intervals for fxa.... Take a look at the next plot where this is done and you can see why...
In [457]:
# Create a range of 'x' values to iterate over
i = np.array(range(int(min(cobs)),int(max(cobs)),1))
# Extract the mean values for the model parameter from the trace dataframe
g = t_df['gamma'].mean()
e50 = t_df['ec50'].mean()
em = t_df['emax'].mean()
# recreate the 'y' values of fxa
y = (em*(i**g)) / ((e50**g) + (i**g))
# Then plot everything
plt.figure(num=None, figsize=(8, 8), facecolor='w', edgecolor='k')
plt.scatter(cobs, fxa_in, alpha =0.2)
plt.plot(i, y ,'r')
# Extract the quantile values for the model parameter from the trace dataframe
g = t_df['gamma'].quantile(q=0.05)
e50 = t_df['ec50'].quantile(q=0.05)
em = t_df['emax'].quantile(q=0.055)
# recreate the 'y' values of fxa
y = (em*(i**g)) / ((e50**g) + (i**g))
# Then plot the 5 quantile
plt.plot(i, y ,'g')
# Extract the quantile values for the model parameter from the trace dataframe
g = t_df['gamma'].quantile(q=0.95)
e50 = t_df['ec50'].quantile(q=0.95)
em = t_df['emax'].quantile(q=0.95)
# recreate the 'y' values of fxa
y = (em*(i**g)) / ((e50**g) + (i**g))
# Then plot the 95 quantile
plt.plot(i, y ,'g')
Out[457]:
Notice how the quantile lines are not representative of the uncertainty in fxa. And certainly do not reflect a 90% credible interval.
what about plotting all the traces? we could try....
In [459]:
# Create a range of 'x' values to iterate over
i = np.array(range(int(min(cobs)),int(max(cobs)),1))
# Extract the mean values for the model parameter from the trace dataframe
g = t_df['gamma'].mean()
e50 = t_df['ec50'].mean()
em = t_df['emax'].mean()
# recreate the 'y' values of fxa
y = (em*(i**g)) / ((e50**g) + (i**g))
# Then plot everything
plt.figure(num=None, figsize=(8, 8), facecolor='w', edgecolor='k')
plt.scatter(cobs, fxa_in, alpha =0.2)
plt.plot(i, y ,'r')
for a,b in enumerate(trace) :
g = trace[a]['gamma']
e50 = trace[a]['ec50']
em = trace[a]['emax']
y = (em*(i**g)) / ((e50**g) + (i**g))
plt.plot(i,y, 'b' ,alpha=0.1)
Nope thats not working either! So how might we proceed?
This code is a modified version of a function I found on Jon Sedars excellent blog post about bayesian modelling found here...
http://blog.applied.ai/bayesian-inference-with-pymc3-part-1/
In [119]:
def plot_posterior_cr(mdl, trc, rawdata, xlims, npoints=1000):
'''
Convenience fn: plot the posterior predictions from mdl given trcs
'''
x = np.linspace(xlims[0], xlims[1], 1000)
## extract traces
trc_sd = trace_to_dataframe(trc)['sigma']
trc_gamma = np.array(trace_to_dataframe(trc)[['gamma']])
trc_ec50 = np.array(trace_to_dataframe(trc)[['ec50']])
trc_emax = np.array(trace_to_dataframe(trc)[['emax']])
#recreate likelihood
cobs_gamma = x ** trc_gamma
numerator =trc_emax*cobs_gamma
denominator = np.add(cobs_gamma, (trc_ec50+trc_gamma))
final = numerator/denominator
like_mu = final.T
# Expected value of outcome
#fxaMean = (emax*(cobs**gamma)) / ((ec50**gamma) + (cobs**gamma))
like_sd = np.tile(trc_sd.T,(npoints,1))
like = np.random.normal(like_mu, like_sd)
## Calculate credible regions and plot over the datapoints
dfp = pd.DataFrame(np.percentile(like,[2.5, 25, 50, 75, 97.5], axis=1).T
,columns=['025','250','500','750','975'])
dfp['x'] = x
pal = sns.color_palette('Purples')
f, ax1d = plt.subplots(1,1, figsize=(7,7))
ax1d.fill_between(dfp['x'], dfp['025'], dfp['975'], alpha=0.5
,color=pal[1], label='CR 95%')
ax1d.fill_between(dfp['x'], dfp['250'], dfp['750'], alpha=0.4
,color=pal[4], label='CR 50%')
ax1d.plot(dfp['x'], dfp['500'], alpha=0.5, color=pal[5], label='Median')
_ = plt.legend()
_ = ax1d.set_xlim(xlims)
_ = sns.regplot(x='cobs', y='fxa.inh.obs', data=rawdata, fit_reg=False
,scatter_kws={'alpha':0.8,'s':80, 'lw':2,'edgecolor':'w'}, ax=ax1d)
xlims = (mean_df['cobs'].min() - np.ptp(mean_df['cobs'])/10
,mean_df['cobs'].max() + np.ptp(mean_df['cobs'])/10)
plot_posterior_cr(pkpd_model, trace, mean_df, xlims)
This works well, and we get a great plot of credible values for fxa over a range of cobs, but it again relies upon recreating the model. Specifically the model need to be recreated in a form that is amendable for matrix calculations. This is possible if you are a numpy whizz, but not for me of only meager brain.
We can use the values within the fxa_pred traces to evaluate the credible intervals and model fit for fxa without having to rebuild the model and iterate over it with a range of new x values as we did above.
Much simpler and less cumbersome/brainachey.
Lets pull out all the fxa_pred columns from the dataframe. Each column represents the traces obtained for the value of cobs that was passed in to the model. So the first fxa_pred column, "fxa_pred__0" contains 2000 estimates for fxa_pred for cobs with cobs [0] = 9.3812499999999961
So just examining the first element. This is the value of the plasma concentration passed in
In [460]:
cobs[0]
Out[460]:
This is the observed fxa value for that value of cobs
In [467]:
fxa_in[0]
Out[467]:
This is the mean predicted value for fxa from the traces for that value of cobs...
In [465]:
t_df['fxa_pred__0'].mean()
Out[465]:
So lets pull all the columns out and plot the predicted values for each cobs value and create a dataframe for plotting....
In [468]:
filter_col = [col for col in list(t_df) if col.startswith('fxa_pred')]
In [469]:
col= pd.DataFrame()
to_col =pd.DataFrame()
for n, cols in enumerate(filter_col):
to_col['fxa_pred']=t_df[cols]
to_col['cobs'] = cobs[n]
col = pd.concat([col, to_col])
There is probably a much more elegant way of creating the dataframe for plotting!
In [470]:
plt.figure(figsize=(10,10))
plt.scatter(col['cobs'], col['fxa_pred'], alpha =0.002, s= 15 ,color ='grey')
plt.scatter(cobs, fxa_in, alpha =1, color='red')
Out[470]:
This looks pretty good, we get a feel for the credible intervals for the fxa_pred and we can sort of hack it to create a 90% credible interval...
In [336]:
col= np.empty([1,4])
for n, cols in enumerate(filter_col):
a = cobs[n]
b = t_df[cols].quantile(q=0.5)
c = t_df[cols].quantile(q=0.95)
d = t_df[cols].quantile(q=0.05)
e = np.array([a,b,c,d]).reshape(1,4)
col = np.concatenate((col,e))
col = np.delete(col, (0), axis=0)
col = pd.DataFrame(col, columns=['cobs', 'mean', 'max', 'min'])
col = col.sort_index(by=['cobs'])
In [337]:
plt.figure(figsize=(10,10))
plt.plot(col['cobs'], col['mean'], '-', color = 'blue', alpha =0.5)
plt.plot(col['cobs'], col['max'], '--', color = 'blue', alpha =0.5)
plt.plot(col['cobs'], col['min'], '--',color = 'blue', alpha =0.5)
plt.scatter(cobs, fxa_in, alpha =1, color='red')
Out[337]:
So the above was all very well, but its not completely satisfactory. What if we wanted to evaluate the range of responses that we might expect to see for a plasma concentration that hadnt already been observed?
Well thanks to a little poking around on stackoverflow I came across this post which answered thing perfectly... http://stackoverflow.com/questions/33249122/generating-predictions-from-inferred-parameters-in-pymc3
The following is an implementation of the reply on the thread and it seems to work pretty nicely...
It utilises the concept of a theano shared variable, which in the first instance we pass in our cobs list, then when we want to create prediction we update the shared variable with the values we are interested in predicting.
In [472]:
from theano import shared
In [473]:
cobs_shared = shared(cobs)
In [474]:
pkpd_model = Model()
with pkpd_model:
# Priors for unknown model parameters
emax = Uniform ('emax', lower =0, upper =100)
ec50 = Lognormal('ec50', mu=0.01, tau = 5)
gamma = Uniform('gamma', lower=0, upper =10)
sigma = Uniform('sigma', lower = 0, upper = 1000 )
# Expected value of outcome - note only change to the model is cobs_shared rather than cobs
fxaMean = (emax*(cobs_shared**gamma)) / ((ec50**gamma) + (cobs_shared**gamma))
# Likelihood (sampling distribution) of observations
fxa = Normal('fxa', mu=fxaMean, sd=sigma, observed=fxa_in )
In [475]:
with pkpd_model:
# obtain starting values via MAP
start = find_MAP(fmin=scipy.optimize.fmin_powell)
# draw 2000 posterior samples
trace = sample(2000, start=start)
Now a little magic, lifted straight from the post
In [342]:
from collections import defaultdict
def run_ppc(trace, samples=100, model=None):
"""Generate Posterior Predictive samples from a model given a trace.
"""
if model is None:
model = pm.modelcontext(model)
ppc = defaultdict(list)
for idx in np.random.randint(0, len(trace), samples):
param = trace[idx]
for obs in model.observed_RVs:
ppc[obs.name].append(obs.distribution.random(point=param))
return ppc
Now lets make an array of cobs values we would like the fxa posterior for...
In [344]:
test_cobs =np.array([50,100,200,400,600])
set the shared variable with our test values...
In [345]:
cobs_shared.set_value(test_cobs)
call the run_ppc function....
In [347]:
ppc = run_ppc(trace, model=pkpd_model, samples=200)
This creates a dictionary with a list of arrays, we can split these out into a dataframe for understanding and manipulation easily...
In [383]:
ppc_df =pd.DataFrame(ppc['fxa'], columns = test_cobs)
In [385]:
ppc_df.head()
Out[385]:
So here it is, we have all the values that we require for our dependent variable fxa for the values of cobs we passed in
In [387]:
ppc_df.describe()
Out[387]:
And we can have crazy fun in seaborn looking at the distributions...
In [399]:
for c in ppc_df.columns:
sns.distplot(ppc_df[c], rug =True)
or...
In [425]:
unstacked = pd.DataFrame(ppc_df.unstack()).reset_index()
unstacked.columns = ['dose', 'dose_index', 'dose_value']
g = sns.FacetGrid(unstacked, col = 'dose', col_wrap=2, size=5)
g = g.map(sns.distplot, "dose_value", rug=True)
In [438]:
plt.figure(figsize=(10,10))
plt.scatter(cobs, fxa_in, alpha =1, color='red')
plt.scatter(unstacked['dose'], unstacked['dose_value'], alpha =0.02)
for c in ppc_df.columns:
plt.scatter(c, ppc_df[c].mean(), color= 'b',s =50)
In [ ]: